Introduction

In this R Markdown document, we will analyze various attributes of Red wine and how they influence quality of the Red wine.

We are provided with Red wine data set with 1,599 observations and each record has following attributes:

  1. Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
  2. Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
  3. Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
  4. Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
  5. Chlorides: the amount of salt in the wine.
  6. Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
  7. Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
  8. Density: the density of water is close to that of water depending on the percent alcohol and sugar content.
  9. PH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
  10. Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
  11. Alcohol: the percent alcohol content of the wine.

Following assumptions were made during the analysis:

  1. Fixed acidity, Volatile acidity, Citric acid, Chlorides, Free sulfur dioxide, Total sulfur dioxide and Sulphates in the data set is measured in grams/liter.
  2. Density in the data set is measured in grams/ cubic liter.
  3. Domain knowledge is acquired from http://waterhouse.ucdavis.edu/whats-in-wine/red-wine-composition

As we progress in the document we will:

  1. Analyze range and distribution of individual attributes.
  2. Compare how related attributes influence each other.
  3. How individual attributes affect quality of wine.
  4. Finally, how related attributes influence quality of wine in conjunction with the influencing each other.

Data ingestion and high-level structural analysis

Lets load data in CSV format into R data frame and analyze data structure:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Even though quality is loaded as of type int, the numeric values are finite and are within range 0-10. So, we will convert the quality into factor type (categorical value).

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

Now that we have fixed data type of columns, let’s check if any of the columns contain missing values.

## [1] FALSE

From the above code snippet, we are sure that there are no missing values, now we can proceed analyzing the data.

Univariate Plotting and Analysis

We will be using ggplot2 library to perform Univariant Plotting.

Let’s start analyzing range and distribution of all quantitative attributes in the data set.

Fixed Acidity

From the below summary report, we have fixed acidity in our current data set in the range of 4.6 to 15.9

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Now let’s analyze the frequency distribution of fixed acidity captured in the data set. Below bar chart depicts that most of the wine in the data set has fixed acidity between 7 and 9, which is in conjunction with the IQR calculated in the summary.

Volatile Acidity

From the below summary report, we have Volatile Acidity in our current data set in the range of 0.12 to 1.58

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Below bar chart depicts frequency distribution of volatile acidity of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.

Citric Acid

From the below summary report, we have citric acid measurements in our current data set in the range of 0 to 1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Below bar chart depicts frequency distribution of citric acid measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.

Residual Sugar

From the below summary report, we have residual sugar measurements (grams per litre) in our current data set in the range of 0.9 to 15.5

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Below bar chart depicts frequency distribution of residual sugar measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.

Chlorides

From the below summary report, we have chloride measurements in our current data set in the range of 0.012 to 0.611

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Below bar chart depicts frequency distribution of chloride measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.

Free Sulfur Dioxide

From the below summary report, we have Free Sulfur Dioxide measurements (parts per million) in our current data set in the range of 1 to 72

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Below bar chart depicts frequency distribution of Free Sulfur Dioxide measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.

Total Sulfur Dioxide

From the below summary report, we have Total Sulfur Dioxide measurements in our current data set in the range of 6 to 289

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Below bar chart depicts frequency distribution of Total Sulfur Dioxide measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.

Density

From the below summary report, we have Density measurements in our current data set in the range of 0.9901 to 1.0037

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Below bar chart depicts that most of the wine in the data set has fixed acidity between 0.99 and 1, which is in conjunction with the IQR calculated in the summary.

pH

From the below summary report, we have pH measurements in our current data set in the range of 2.74 to 4.01

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Below bar chart depicts frequency distribution of pH measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.

Sulphates

From the below summary report, we have Sulphates measurements in our current data set in the range of 0.33 to 2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Below bar chart depicts frequency distribution of Sulphates measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.

Alcohol

From the below summary report, we have Alcohol measurements in our current data set in the range of 8.4 to 14.9

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Below bar chart depicts frequency distribution of Alcohol measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.

Quality

Below bar chart depicts frequency distribution of Quality counts in the data set.

Bivariant Plotting and Analysis

Till now we have analyzed the range and frequency distribution of attributes in isolation strictly from statistical point of view. Going forward lets analyze the key attributes of wine and how they interpolate.

Note: All the attributes are curbed at 99th percentile to ignore outliers.

Acidity is a fundamental property of wine, imparting sourness and resistance to microbial infection. Doug Nierman, 2004

Acids are major wine constituents and contribute greatly to its taste. Traditionally total acidity is divided into two groups, namely the volatile acids and the nonvolatile or fixed acids. Fixed acids originate from grapes, higher the fixed acidity content sourer the wine will be.

Fixed Acidity is inversely proportional to pH scale (0 (very acidic) to 14 (very basic)). Below scatter plot depicts the same.

Citric acid found in small quantities, citric acid can add ‘freshness’ and flavor to wines. Addition of Citric acid will increase overall fixed acidity.

Volatile acidity is primarily acetic acid generated during fermentation process. Acetic acid will result in the concomitant formation of other, sometimes unpleasant, aroma compounds. One of the preventive methodologies reduce amount of acetic acid is to inject Sulphates during fermentation process.

Below scatter plots depicts as the amount of Sulphur content (any form) in wine varies with the level of volatile acidity. Even for the same amount of Sulphur content, we are seeing volatile acidity varying in wide range, which suggests that other factors (may or may not be captured in the data set) that are influencing volatile acidity.

Now that we have analyzed how related attributes are influencing each other. We will continue by plotting and analyzing how each attribute influence the quality of the wine. We will utilize box plots for this analysis.

Below box plot depicts the distribution of fixed acidity for each of the quality levels. From the plot:

  1. High fixed acidity values lead to lower quality wines.
  2. For wines with quality 5, the range of fixed acidity is high.
  3. For high quality wines median acidity level slightly above 8.

Below box plot depicts the distribution of volatile acidity for each of the quality levels. From the plot:

  1. Higher levels of volatile acidity lead to lower quality wines.
  2. For wines with quality 5, the range of volatile acidity is high.
  3. For higher quality wines (> 6 rating), median volatile acidity of the quality group is less than overall median volatile acidity level.
  4. Ignoring outliers we can say higher quality wines (>7 rating) have volatile acidity levels < 0.8 gm/l

Below box plot depicts the distribution of citric acidity for each of the quality levels. From the plot:

  1. Higher levels of citric acid content lead to higher quality wines.
  2. For wines with quality 5, the range of citric acidity is high.
  3. For higher quality wines (> 6 rating), median citric acidity of the quality group is greater than or equal to overall median volatile acidity level.

Below box plot depicts the distribution of residual sugar for each of the quality levels. From the plot:

  1. Higher levels of residual sugar content lead to higher quality wines.
  2. Wines with quality 5 and 6 has multiple wines that have a large number of outlers.
  3. All quality levels have respective median residual sugar content approximately same as overall median residual sugar level.

Below box plot depicts the distribution of sulphates for each of the quality levels. From the plot:

  1. Higher levels of sulphates lead to higher quality wines.
  2. For wines with quality 5, the range of sulphates is high.
  3. For highest quality wines (8 rating), the minimum sulphates content is > median overall sulphates content (0.62).

Below box plot depicts the distribution of pH levels for each of the quality levels. From the plot:

  1. Higher pH levels lead to higher quality wines.
  2. For wines with quality 6, the range of pH is high.
  3. For higher quality wines (> 6 rating), median pH is approximately 3.25.

Below box plot depicts the distribution of alcohol percentage for each of the quality levels. From the plot:

  1. Higher alcohol percentage lead to higher quality wines.
  2. For wines with quality 5, the range of alcohol percentage is high.
  3. For higher quality wines (> 6 rating), median alcohol content of the quality group is greater than overall alcohol content level.

Multivariant Plotting and Analysis

Till now we have analyzed attributes of wine in isolation, how attributes with in the same class affect each other in detail. Now let’s analyze how grouped attributes influence the quality of the wine.

Just like any other business the goal for producing red wines will be to produce high quality wines. Now let’s analyze how grouped attributes influence the quality of the wine.

We will analyze keeping alcohol content as primary parameter i.e, for a given alcohol level how other attributes influence quality of the wine.

Final plots and Summary

Along with plotting the values from dataset, we will model the data using stat_smooth function. We will let the function to auto select the method and formula with 95% confidence level. This will be good starting point to visually identify trends.

Alcohol vs Residual Sugar

Below couple of plots represent how quality is influenced by Alcohol and Residual Sugar along with model to predict potential quality range for give alcohol and residual sugar values. Wines with higher sugar level will add sweetness to the wine.

From the above plots we can deduce:

  1. Having more data will yield to better model. For wines with quality 5 and 6 we have lot of data in comparison with other quality buckets there by generating better model.
  2. Wines will low alcohol and sugar level yields to lower quality wines.

Alcohol vs Citric Acid

Below couple of plots represent how quality is influenced by Alcohol and Citric Acid along with model to predict potential quality range for give alcohol and residual sugar values. Citric acid will add freshness and flavor to the wine. Not all wines contain citric acid, for better modeling we will ignore the wines with no Citric Acid content.

From the above plots we can deduce:

  1. Having more data will yield to better model. For wines with quality 5 and 6 we have lot of data in comparison with other quality buckets there by generating better model.
  2. Wines with lower Citric Acid content are also marked of higher quality if the alcohol content is more.

Alcohol vs Volatile Acidity

Below couple of plots represent how quality is influenced by Alcohol and Volatile Acidity along with model to predict potential quality range for give alcohol and volatile acidity values. Wines with higher volatile acidity levels lead to unpleasant aroma.

From the above plots we can deduce:

  1. Having more data will yield to better model. For wines with quality 5 and 6 we have lot of data in comparison with other quality buckets there by generating better model.
  2. Wines with lower volatile acidity are also marked of lower quality if the alcohol content is low.

Reflections

Red wine data set is relatively small in size with only 1,599 observations with 11 attributes. Data set is very clean and did not require any data wrangling.

As the data set size was small and there was no complete documentation, domain knowledge to analyze the data was acquired from http://waterhouse.ucdavis.edu/whats-in-wine/red-wine-composition.

Even though document has mentioned how greatly volatile acidity, citric acid influence quality of the wine the data set did not support the hypothesis. For both volatile acidity and citric acid even

Even though document has mentioned that higher volatile acidity will cause bad aroma, dure to high alcohol content the wines are rated as of high quality. Wines with lower Citric Acid content are also marked of higher quality if the alcohol content is more.

Following assumptions were made during the analysis:

  1. Fixed acidity, Volatile acidity, Citric acid, Chlorides, Free sulfur dioxide, Total sulfur dioxide and Sulphates in the data set is measured in grams/liter.
  2. Density in the data set is measured in grams/ cubic liter.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.